﻿ An Insight into CoRoLa The Reference Corpus of Written and Spoken Contemporary Romanian Dan Tufiș, Dan Cristea Romanian Academy 1 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Data collections • Archives – a repository of readable electronic texts not linked in any coordinated way • Electronic Text Library - a collection of electronic texts in standardised format with certain conventions relating to content etc, but without rigorous selectional constraints • Corpus - a subset of an ETL, built according to explicit design criteria for a specific purpose: a corpus is not a collection of texts which are deemed 'interesting' or 'useful' of themselves; the texts in the corpus are interesting and useful for the study of language Sue Atkins, Jeremy Clear, Nicholas Ostler – Corpus Design Criteria, 1991 2 The 9th SpeD conference, Bucharest, 6-9 July, 2017 A Life-time Enterprise • One of the most important decisions of a NLP community is building a reference corpus for the language in case • It is a scientifically exciting, multidisciplinary project and it has a major cultural dimension • In an IPR strictly regulated society, gathering large quantities of text and speech data, representative for a language is not an easy task • It has to be maintained over an indefinite period of time The 9th SpeD conference, Bucharest, 6-9 July, 2017 3 What is COROLA? • A priority project of the Romanian Academy (2014-2017) • The Reference Corpus for Contemporary Romanian Language: Contemporary: since (~)1999 Reference: covering all literary language registers and styles 500,000,000 words: ~5,000 novel books) and speech data (~ 300 hours of recordings) deeply processed (morpho-lexically, syntactically and partial semantically) with standard documentation • Covering all all functional styles of the literary language: scientific, official, journalistic and imaginative • Covering 5 large domains (arts &culture, society, science, nature and others) which are further refined into 71 sub- domains 6 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Foreseen applications - Linguistic studies (lexicography, terminology, syntax, semantics); educational instrument for students - Language modeling for automatic processing of Romanian language - Translation modeling - Language learning - Intelligent indexing and multi-criterial retrieval (text &speech) - Semantic classification of large volumes of data (text & speech) - Knowledge extraction from data (text & speech) - Automatic summarization of documents - Question answering in Romanian language (v Watson – Jeopardy!) - Automatic speech recognition and synthesis - Machine Translation (text & speech) The 9th SpeD conference, Bucharest, 6-9 July, 2017 7 COROLA corpus What are the level of linguistic annotations: Currently: Phoneme, syllable, word, POS tagging, syntactic chunking, dependency parsing (tree-bank prototype) Foreseen: sense-tagging (based on Ro-Wordnet sense inventory); discourse mark-up (NE linking, anaphora resolution) Tools for language data processing: Currently: TTL, Bermuda, RARE, MIRA, Grammar Studio, MaltParser, yEd, Sphinx, HTK, and many others The 9th SpeD conference, Bucharest, 6-9 July, 2017 8 The current status (June 2017) distribution of textual data Style Domain arts&culture 27,697,861 journalistic 77,277,228 society 119,150,171 science 184,761,720 others 571,986,834 imaginative 51,617,302 science 160,309,410 others 2,100,318 nature 1,831,275 memoirs 26,135,623 administrative 11,564,015 law 527,519,345 TOTAL 880,975,551 TOTAL 880,975,551 distribution of speech data Corpus Type Source Time length (h:m:s) RASC many speakers (read) RoWikipedia 14:22:02 RSS-ToBI single speaker (read) news&fairy tales 03:44:00 RADOR many speakers read news& interviews 106:52:33 Radio Iaşi many speakers read news& interviews under development Audio-books single/multiple read stories (~200h) (not IPR cleared) speaker 134:57:24 SWARA 9The 9th SpeD conference, Bucharest, 6-9 July, 2017 COROLA corpus We concluded Agreements with IPR holders for language data to be included into the corpus: • Humanitas, Polirom, Romanian Academy Publishing House, Bucharest University Press, “Editura Economică”, ADENIUM Publishing House, DOXOLOGIA Publishing House, the European Institute Publishing House, GAMA Publishing House, PIM Publishing House (books) • România literară, Muzica, Actualitatea muzicală, Destine literare, DCNEWS, PRESSONLINE RO, the school magazine of Unirea National College from Focșani (journals, news) • Bloggers: Simona Tache, Dragoș Bucurenci, Irina Șubredu and Teodora Forăscu • Oral texts (read news, live transmissions and live interviews) are provided by Rador, the press agency of Radio Romania from Bucharest and by Radio Iași 10 The 9th SpeD conference, Bucharest, 6-9 July, 2017 What’s in a corpus? A corpus, besides proper language data, contains additional information on the properties of texts (written or spoken) that are included This is achieved by means of annotation The annotation is a principal feature of the corpus, distinguishing it from collections of texts Representing this type of information is a matter of standardization for lots of good reasons (identification, dissemination, aggregation, interoperability, etc ) A corpus, usually, includes two types of annotation: a) metatextual (information about the text) -metadata, b) linguistic- phonetic, prosodic, morphological, phrasal, syntactic, semantic, pragmatic (not necessarily all) The 9th SpeD conference, Bucharest, 6-9 July, 2017 11 Annotations • Inline – traditional annotations (LOB, Pen-treebank, ROMBAC, etc) most of them based on XML schemas Meta- information resides into the text-files Adequate for non- overlapping layers and single viewpoints (tokenization, POS tagging, chunking) • Stand-off – annotations are stored separately from the primary data leaving the primary data untouched (DeReKo) • Hybrid annotations - ex TCF uses inline annotation for the tokenization and POS tagging layer and stand-off for the syntactic layer 12 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Metadata schema (I) • Metadata are essential for indexing the corpus and they facilitate the search process for end users • Metadata for language resources and tools exists in a multitude of formats: we opted to use the CMDI (Component MetaData Infrastructure) metadata format • CMDI offers ready-made sets of metadata elements (components) for various types of resources; they can be edited, modified, or combined into personalized metadata schemas - profiles • The CMDI model has close ties to the ISOcat data category registry The 9th SpeD conference, Bucharest, 6-9 July, 2017 13 Metadata schema (II) hp://www clarin eu/content/component-metadata When adding an element to a CMDI component, the metadata modeler has to add a link to a Concept Registry (based on the ISOcat data category registry) where very detailed definitions are available This link provides a persistent and unique identification of the intended semantics 14 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Metadata schema (III) Starting from detailed CMDI profiles created in the CLARIN project for annotated text and speech corpora, we have designed profiles tailored to our specific needs: ü general information (corpus level): creators of the corpus, the availability and the license, the development status, the projects and cooperation agreements that support the creation etc ü specific information (document level): the document/article title, collection and publication date, document type, document (literary) style, document domain/sub-domain, the author, the source, annotation details (tools, level of annotation, validation of annotation, etc ), the number of words 15 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Linguistic annotations Currently: Phoneme, syllable, word, POS tagging, syntactic chunking, dependency parsing; Foreseen: sense-tagging (based on Ro-Wordnet sense inventory); discourse mark-up (NE linking, anaphora resolution) Tools for language data processing: TTL, Bermuda, RARE, MIRA, MaltParser, TensorFlow, yEd, Sphinx, HTK, and many others 16 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Morpho-lexical processing: example The 9th SpeD conference, Bucharest, 6-9 July, 2017 17 Oral texts automatic processing - annotation levels - • Accompanied by their written counterpart • Alignment: oral sentence – written sentence – Lemmatization – Tokenization – Part-of-speech tagging – Syllabification – Some allophones The 9th SpeD conference, Bucharest, 6-9 July, 2017 18 In-house recording • Application developed at ITI-Iași (dr V Apopei) coupled with Praat (transcription and turn-taking alignment) Handbook tedious Praat (transcription + turn-taking alignment) Metadata (a ) wav, 16 bit, 22050 Hz, mono Volunteer Handbook tedious Praat (turn-taking alignment) T ranscription (txt, doc) (b) 19 The 9th SpeD conference, Bucharest, 6-9 July, 2017 In-house sound recordings The 9th SpeD conference, Bucharest, 6-9 July, 2017 20 Processing speech data 0 00 0 63 [silence] 0 63 1 38 desnderea 0 6300000 sil sil6300000 7000000 1 38 1 48 [silence] d desnderea7000000 7500000 e7500000 8400000 1 48 1 92 r3ce1te s8400000 9300000 1 92 2 45 aburul t9300000 9900000 2 45 2 76 [silence] i9900000 10800000 2 76 3 04 asel Word-n10800000 11200000 3 04 3 17 c3 me d11200000 11800000 3 17 3 46 poate alignment e11800000 12100000 3 46 3 78 ap3rea r12100000 12600000 3 78 3 82 [silence] e@12600000 13800000 3 82 4 27 condensarea a13800000 14800000 Phoneme- 4 27 4 50 unei sp14800000 15500000 me r r3ce1te15500000 16200000 4 50 4 83 p3r2i @16200000 17200000 4 83 5 10 din ch17200000 17500000 alignment 5 10 5 42 abur e17500000 18000000 5 42 5 79 [silence] th The 9 SpeD conference, Bucharest, 6-9 July, 2017 21 Corpus Management Platforms • Corpus management: – acquision of raw data (text, speech) – cleaning – metadata – maintaining – access 22 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Processing data: Curator – Provider – Portal Portalul COROLA The 9th SpeD conference, Bucharest, 6-9 July, 2017 23 CoDaP: CoRoLa Data cleaning and metadata Platform (http://89 38 230 23/) Internaonal Conference on Proceedings of SpeD 10/07/17 24 conference, Bucharest, 6-9 July, 2017 The 9th SpeD conference, Bucharest, 6-9 July, 2017 Access • CoRoLa will be open for querying in two environments – The IMS Open Corpus Workbench (CWB), http://cwb sourceforge net/ – The KorAP Query interface (IDS Mannheim) • Both IMS and KorAP are equipped with specific corpus investigation facilities (counting tokens filtered by user- specified criteria, collocation analysis, concordancing, various statistical test batteries, etc ) • Downloadable to a certain extent The 9th SpeD conference, Bucharest, 6-9 July, 2017 25 CoRoLa in IMS Open Corpus Workbench (CWB) Internaonal Conference on Proceedings of SpeD 10/07/17 conference, Bucharest, 6-9 July, 2017 26 The 9th SpeD conference, Bucharest, 6-9 July, 2017 26 CoRoLa in KorAP 27 The 9th SpeD conference, Bucharest, 6-9 July, 2017 The KorAP Query interface • Suited for management of large corpora (tens of billions of words) • Easily adaptable to different annotation styles • Powerful query language: • multiple levels, • query criteria: any field in the metadata and any possible combination of these fields • user can build his/her own virtual corpus: filtered subcorpora (e g ”texts on architecture published between 2000 and 2005”) • Search results: snippets of a reasonable size for linguistic investigations (1-2 sentences) • Allows for distributed data (Bucharest, Iași, Mannheim) The 9th SpeD conference, Bucharest, 6-9 July, 2017 28 KorAP accesses also: DeReKo - Deutsche Reverenzkorpus DeReKo – at IDS, Mannheim – the world’s largest collection of German texts (>25 billion tokens) – a broad variety of text types with a quantitative focus on newspaper texts and rapidly growing portion of computer mediated communication The 9th SpeD conference, Bucharest, 6-9 July, 2017 29 DRuKoLA Objectives (I) 1 Construction and harmonization of comparable corpora in German and Romanian 2 Development of criteria for building comparable virtual sub- corpora from DeReKo and CoRoLa, based on metadata and other possible text properties 3 Exploration of language-specific peculiarities of the studied languages and equivalences with respect to different parameters and structures 4 Some corpus-based comparative case studies on a) markers of modality: haben/a avea with zu-infinitives and supine, b) (abstract) demonstratives in German and Romanian, c) investigation of distributional semantic and syntagmatic properties of corresponding forms and structures The 9th SpeD conference, Bucharest, 6-9 July, 2017 30 DRuKoLA Objectives (II) 5 Experimentation and enhancement of a common corpus analysis platform to share the corpus, technical and research results 6 Building a crystallization structure to serve other national or reference corpora, with the long-term goal of pioneering a federated, at least European, reference corpus, where each collection of texts is still physically located at and curated by its responsible institute, but can be dynamically queried and extracted to different comparable corpora The 9th SpeD conference, Bucharest, 6-9 July, 2017 31 Towards integrating multiple linguistic resources • Methodology: – Common practice: high level language processors are trained on resources that mix the raw linguistic data with expert annotation – Then: • use CoRoLa as an anchor on which these other linguistic resources are coupled • build an environment that allows complex queries, simultaneously accessing resources of different types The 9th SpeD conference, Bucharest, 6-9 July, 2017 32 Linguistically Linked Open Data – resources that help to build text processors for CoRoLa • Thesaurus dictionary of the Romanian language in electronic form – eDTLR http://edtlr info uaic ro/ – train a word sense disambiguation program • Romanian treebank with +10,000 sentences in 2017 – RoDep Treebank at RACAI and UAIC- FII – train a syntactic parser The 9th SpeD conference, Bucharest, 6-9 July, 2017 33 Linguistically Linked Open Data – resources that help to build text processors for COROLA • A semantically annotated treebank – – correct syntactic roles on semantic ground Thanks Cătălina Mărănduc, PhD report, Nov 2016 9th SpeD conference, Bucharest, 6-9 July, 2017 34 The Linguistically Linked Open Data – resources that help to build text processors for COROLA • A corpus of semantic relations – QuoVadis http://nlptools info uaic ro/Resources jsp – train a program to recognise coreference, affective, kinship, social relations The 9th SpeD conference, Bucharest, 6-9 July, 2017 35 Continuation of CoRoLa: a diachronic corpus • Acquisition of textual data – including helped by a Cyrillic OCR (CyRo – a project in the second phase of evaluation) • Infer paradigmatic morphology of old Romanian – from eDTLR citations and other sources • JUST IMAGINE: manuscripts è scanned è OCRed è transcribed è POS-tagged etc è included in the diachronic corpus The 9th SpeD conference, Bucharest, 6-9 July, 2017 36 Further prospects • EuReCo - The idea of a European Reference Corpus Current Situation – several national initiatives loosely connected by bilateral contacts – co-operation within CLARIN but subordinated to various other goals and funding necessities – coordination via EFNIL, so far mostly unrelated to corpora – some initiatives maintain their own parallel or comparable corpora • Joining forces – particularly desirable for comparable corpora, several national and reference corpora built and maintained anyway – creating methodology and techniques for joining them virtually • each national centre still responsible for its language • each corpus still physically located at its centre The 9th SpeD conference, Bucharest, 6-9 July, 2017 37 Acknowledgements • To our colleagues involved in CoRoLa: – from AR-RACAI: Verginica Barbu-Mitelu, Tibi Boroș, Ștefan Dumitrescu, Radu Ion, Elena Irimia – from AR-IIT: Vasile Apopei, Cecilia Bolea, Daniela Gîfu, Alex Moruz, Mihaela Onofrei, Laura Pistol, Andrei Scutelnicu • To our collaborators in DRuKoLa: – from Univ Bucharest: Ruxandra Cosma – from IDS Mannheim: Nils Diewald, Marc Kupietz, Eliza Margaretha, Andreas Wi Internaonal Conference on Proceedings of SpeD 10/07/17 conference, Bucharest, 6-9 July, 2017 38 39 The 9th SpeD conference, Bucharest, 6-9 July, 2017 